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Abstract 

The Roma people, living throughout Europe, are a diverse population 
linlted by the Romani language and culture. Previous linguistic and genetic 
studies have suggested that the Roma migrated into Europe from South 
Asia about 1,000-1,500 years ago. Genetic inferences about Roma history 
have mostly focused on the Y chromosome and mitochondrial DNA. To 
explore what additional information can be learned from genome-wide data, 
we analyzed data from six Roma groups that we genotyped at hundreds of 
thousands of single nucleotide polymorphisms (SNPs). We estimate that 
the Roma harbor about 80% West Eurasian ancestry — deriving from a 
combination of European and South Asian sources — and that the date of 
admixture of South Asian and European ancestry was about 850 years ago. 
We provide evidence for Eastern Europe being a major source of European 
ancestry, and North-west India being a major source of the South Asian 
ancestry in the Roma. By computing allele sharing as a measure of linkage 
disequilibrium, we estimate that the migration of Roma out of the Indian 



subcontinent was accompanied by a severe founder event, which we 
hypothesize was followed by a major demographic expansion once the 
population arrived in Europe. 

Authors Summary 

Inferences of history based on autosomal genetic markers can provide precise 
information about a population's history. To characterize the history of the Roma 
gypsy population, we applied genomic methods based on allele frequency 
correlations, linkage disequilibrium, and identity-by-descent sharing. We provide 
formal evidence that the Roma have ancestry from West Eurasians and South 
Asians, with the likely sources related to Eastern Europeans and North-west 
Indians respectively. We estimate that the major gene exchange occurred about 
850 years ago, soon after the exodus of Roma out of the Indian sub-continent. 
The migration out of India was accompanied by a severe founder event, 
signatures of which have been preserved for hundreds of years because of the 
endogamy prevalent in the Roma community. 
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Introduction 

The Roma (also called Roman! or Gypsies) represent a unique and 
diverse European population. They speak more than 60 dialects of a rapidly 
evolving language called Romani and belong to different social and religious 
groups across Europe. Their census size has been estimated to be in the range 
of 10-15 million[1], with the largest populations in Eastern Europe[2]. They do not 
have written history or genealogy (as Romani does not have a single convention 
for writing) and thus most of the information about their history has been inferred 
based on linguistics, genetics and historical records of the countries where they 
have resided. 

Previous studies have suggested that the Roma are originally from India, 
and that they migrated to Europe between the 5'*^ and 10*^ century[3]. It has been 
argued that their migration route included Persia, Armenia, Anatolia and 
Greece[3,4]. The Roma then settled in multiple locations within Europe and 

descendants of these migrants mostly live in the Balkans, Spain and Portugal 
today. By the 15'*^ century, the Roma were present in almost all parts of 
Europe[5]. 

Anthropological studies of the Roma have documented striking similarities 
between the cultures of various Indian groups and Roma. Social structure in 
Roma groups is similar to the castes of India, where the groups are often defined 
by profession[2,3]. Like many Indian populations, the Roma practice endogamy 
and individuals of one Roma clan (sub-ethnic group) preferentially marry within 
the same group, and marriages across clans are proscribed[3]. Many studies 
have also suggested a link between the Roma and Banjara (the wandering gypsy 
tribes of India) currently residing in central and southern lndia[3]. Linguistic 
analysis of the Banjari or Lamani, languages spoken by the Indian gypsies, have 
little similarity to Romani[6]. Linguistic and genetic studies, however, have 
provided strong evidence for the origin of Roma in India. Y-chromosome marker 
H1a-M82 and mitochondrial haplogroup M35b, both thought to be characteristic 
of South Asian ancestry, are present at high frequency in Roma populations[7,8]. 
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These studies have also documented that since their migration into Europe, the 
Roma have admixed with neighboring European and West Asian 
populations[7,8]. However, there is no consensus about the specific ancestral 
group/ geographic region within South Asia that is related to the ancestral 
population of the Roma. Comparative linguistics suggests that Northwestern 
Indian languages like Punjabi or Kashmiri or Central Indian languages like Hindi 
are most closely related to Romani[9,10]. A recent study based on Y- 
chromosome markers suggests that the Roma descended from Southern Indian 
groups[11], which is contradictory to previous reports based on mtDNA 
haplogroups that have placed the origin of Roma in North-west India. While 
mtDNA and Y chromosome analyses provide valuable information about the 
maternal and paternal lineages, a limitation of these studies is that they represent 
only one instantiation of the genealogical process. Autosomal data permits 
simultaneous analysis of multiple lineages, which can provide novel information 
about population history. 

Here we have analyzed whole genome SNP array data from 27 Roma 
samples belonging to six groups that were sampled from 4 countries in Europe 
(three separate ethnic groups from Hungary, and one group each from Romania, 
Spain and Slovakia). Our aim was to address the following questions: (1) What is 
the source of the European and South Asian ancestry in the Roma? (2) What is 
the relationship of the Roma to the present-day South Asian populations? (3) Do 
present-day Punjabis or South Indians best represent the ancestral South Asian 
component of Roma? (4) What is the proportion and timing of the European gene 
flow? (5) Can we identify founder events or detect genetic signatures of 
endogamy? 

Results 

Genome-wide ancestry analysis of tlie Roma 

We applied Principal Component Analysis (PCA) using the SMARTPCA 
software[12] and the clustering algorithm ADMIXTURE[13] to study the 
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relationship of Roma to other worldwide populations in a merged dataset of 
Roma and HapMapS populations. In PCA, the Roma fall between the South 
Asians (Gujaratis) and Europeans, consistent with their having both South Asian 
and European ancestry and in line with previous mtDNA and Y chromosome 
analyses[7,8] (Figure 1). The ADMIXTURE software, which implements a 
maximum likelihood method to infer the genetic ancestry of each individual 
modeled as a mixture of K ancestral groups, produces very similar 
inferences[13]. The resulting clustering plot for K=6 is shown in Figure 1 and for 
other K values in Figure S1. At K=6, we observe clustering based on major 
continental ancestry. Similar to the PCA results, the Roma individuals cluster with 
South Asians and Europeans (Figure 1). Based on the PCA and ADMIXTURE 
analysis, we excluded three Roma outlier samples from further analyses, as they 
appeared to have very recent admixture from neighboring non-Roma European 
populations (likely in the past few generations). We also examined pairwise 
average allele frequency differentiation (Fgt) between Roma and major 
continental groups (see Table S1). 

Previous studies have shown that most present-day South Asians populations 
trace their ancestry to major ancestry components- one related to West 
Eurasians (referred to as Ancestral North Indians (ANI)) and the other related to 
indigenous Andamanese population (Onge) (Ancestral South Indian (ASI))[14]. 
This mixture is pervasive in South Asia and signatures of this mixture are present 
in all caste and social groups and in speakers of Indo-European and Dravidian 
languages[15]. As the Roma trace their ancestry to similar ancestral populations, 
we performed PCA to study the relationship of Roma with the present-day South 
Asians and HapMap populations. We observed that like all South Asians, the 
Roma also fall on the "Indian-cline" (which refers to the differential pattern of 
relatedness of South Asians to Europeans). However, they have much higher 
proportion of European ancestry compared to any other South Asian group 
(Figure 1c). 
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We performed a 4 Population Test[1 5] to test formally if the Roma have evidence 
of a mixture of European and South Asian ancestry. We used individuals of 
Northern European ancestry (CEU) and Andamanese as surrogates for the 
European and South Asian ancestral populations. We tested whether the 
phylogenetic tree (Africans, Europeans, South Asians, Roma) is consistent with 
the data. We choose Onge for this analysis, since, unlike their distant relatives on 
the Indian mainland, they do not have any West Eurasian related admixture[15]. 
Applying the 4 Population Test to each of three simple phylogenetic trees that 
could potentially relate the four groups, we observed highly significant violations 
of the expected phylogenetic tree topology, confirming that the Roma are 
admixed; that is, they have ancestry from both South Asians and Europeans 
(Table S2). We note that this test does not distinguish between European and 
West Asian ancestry and hence we refer to this ancestry component as West 
Eurasia ancestry. 

To quantify the magnitude of the South Asian and West Eurasian ancestry in the 
Roma, we applied Ratio Estimation[15], which can estimate admixture 
proportions in the absence of data from accurate ancestral populations. This test 
estimates the excess of West Eurasian-related ancestry in Roma compared to an 
Onge (who have no known West Eurasian related ancestry[16]). Applying the 
Ratio Estimation to Roma assuming the tree shown in Figure S2, we estimate 
that the Roma have 77.5 ± 1.8% West Eurasian related ancestry (standard errors 
were computed using a Block Jackknife with a block size of 5cM) (Table S2). We 
note that some of the West Eurasian related ancestry we detect likely derives 
from India itself — from the ANI — while other parts may derive from a European 
mixture (post exodus from India). 

Estimating a date of European admixture in tlie Roma 

To estimate the timing of the admixture event, we applied a modified version of 
R0LL0FF[17], which uses the decay of admixture linkage disequilibrium (LD) to 
estimate the time of gene flow. ROLLOFF computes SNP correlations in the 
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admixed population and weiglits the correlations by the allele frequency 
difference in the ancestral populations such that the signal is sensitive to 
admixture LD. While this method estimates accurate dates of admixture in most 
cases, we observed that it is noticeably biased in case of strong founder events 
post admixture (Table S3). The bias is related to a normalization term that 
exhibits an exponential decay behavior in the presence of a strong founder 
event, thus confounding the admixture date (see details in Note S1, Figure S3). 
We propose a modification to the ROLLOFF statistic that removes the bias (Note 
S1, Table S3). In addition, the new statistic computes covariance instead of 
correlation between SNPs; this does not affect the performance of the method 
but makes it mathematically more tractable. Throughout the manuscript, we use 
the modified ROLLOFF statistic {R{d)) unless specified otherwise. Simulations 
show that this statistic gives accurate and unbiased results up to 300 generations 
(Note S2, Figure S4). 

A feature of our method is that it uses allele frequency information in the 
ancestral populations to amplify the admixture signal relative to background LD. 
While data from the ancestral populations is not available for Roma, this 
information can be obtained by performing PCA using the present day 
Europeans and South Asians. Simulations show using PCA-based SNP loadings 
effectively capture the allele frequency differentiation between the ancestral 
populations and can be used for estimating dates of mixture (Note S2, Figure 
S5). 

Applying the modified ROLLOFF statistic to the Roma samples with the SNP 
loadings estimated using PCA of Europeans (CEU) and 16 Indian groups, we 
estimate that the West Eurasian admixture in Roma occurred 29 ± 2 generations 
or about 780-900 years ago in the past assuming one generation = 29 years[18] 
(Figure 2). This is consistent with mixture having occurred only after the 
historically recorded arrival of the Roma In Europe between 1,000-1,500 years 
ago[3]. A potential complication is that the date we are estimating may also be 
reflecting earlier admixture of AN! and AS! ancestry in India itself. However, 
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when we fit the decay of admixture LD using two exponential distributions to 
accommodate the possibility, we obtain dates of 37 and 4 generations. The older 
date corresponds to about 1 ,000 years before present - again consistent with the 
historical record - and both dates are much more recent than any estimates 
obtained by applying ROLLOFF in India. This suggests that the admixture we are 
detecting is genuinely related to events in Europe. 

Relationship with the host European populations 

To learn about the relationship of the Roma with neighboring European 
populations, we estimated the pairwise Identity-by-descent (IBD) sharing 
between each Roma individual and non-Roma individuals sampled from the 
respective countries (Slovakia (n = 1), Romania {n = 14), Hungary (n = 19) and 
Spain {n = 137)). IBD segments (>3 centimorgans (cM)) were detected using 
GERMLINE[19]. The output of GERMLINE was used to compute an average 
pairwise sharing distance between Roma from each geographic region and the 
host populations from that region (Figure S6). We observe that Roma exhibit the 
highest IBD sharing with individuals from Romania (Figure 3a). When we perform 
stratified analysis (where each Roma group is considered separately), we 
observe that the highest sharing is with Romania or Slovakia, consistent with the 
hypothesis that the admixture involved populations in Eastern Europe. However, 
we have very limited samples from some populations here, hence it would 
important to repeat this analysis with more samples. 

Source of the South Asian ancestry in Roma 

To learn about the source of the South Asian ancestry in Roma, we inferred the 
pairwise IBD sharing distance between Roma and various Indian groups, using 
GERMLINE to compute an average pairwise sharing distance between Roma 
and 28 South Asian populations (24 Indian groups from the India Project, Pathan 
and Sindhi from HGDP and Punjabi and Gujarati from POPRES). To simplify the 
analysis, we classified the samples into 8 groups based on geographical regions 
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within India: North {n = 38), Northwest (n = 235), Northeast {n = 8), Southwest (n 
= 16), Southeast (n = 59), East {n = 11), West (n = 42) and Andamanese (n = 
16). We observed that the Roma share the highest proportion of IBD segments 
with groups from the Northwest (Figure 3b). Interestingly, the two populations in 
our sample that show the highest relatedness to Roma (Punjabi, Kashmiri 
Pandit) are also the populations that have highest proportion of West Eurasian- 
related (AN!) ancestry. To control for the possibility that the high IBD sharing 
could be an artifact related to high ANI ancestry, we recalculated the IBD sharing 
regressing out the ANI ancestry proportion and observed that the Roma continue 
to share the highest IBD segments with the northwest Indian group (Note S3). 
These findings are consistent with analyses of mtDNA that also place the most 
likely South Asian source of the Roma in Northwest lndia[8]. 

An important caveat is that we have large discrepancy in the number of samples 
available from different regions in India. In order to control for the sample sizes, 
we performed bootstrap analysis drawing a random sample up to 30 individuals 
from each Indian group and recomputing the IBD statistics. We repeated the 
process a 100 times and estimated the mean and standard error. We observed 
that Roma still have the highest IBD segments with Northwest Indian groups. 
However, it is interesting to note that there is very little variability across the 100 
runs, suggesting that we are perhaps picking up shared signals of selection or 
founder events between Roma and Indian groups (Note S3, Figure S7). 

Characterizing the founder events 

Previous genetic and social studies have shown that the present day Roma 
population has descended from a small number of ancestors with subsequent 
genetic and cultural isolation[8,20]. A history of founder events in a population 
can lead to an increase in homozygosity and large stretches of allele sharing 
across individuals within the same population. This can be measured by 
estimating the proportion of the autosomal genome that has homozygous 
genotypes. We applied PLINK v1.07[21] to compute a genomic measure of 
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individual autozygosity for all Roma individuals and 30 random individuals from 
each HapMap population. PLINK uses a sliding window approach to find regions 
of the genome that are at least 1MB in length and contains 100 contiguous 
homozygous SNPs. For each individual, we computed the overall length of the 
autozygous segments and observed that the Roma have very high level of 
autozygosity compared to other HapMap populations (Figure 4a). 

To estimate the date of the founder event in Roma, we computed a distance 
based statistic that measures allele sharing as reported in Reich et al (2009)[16]. 
This method is based on computing the autocorrelation of allele sharing between 
pairs of individuals from one group, and then subtracting the cross-population 
autocorrelation to remove the effects of ancestral allele sharing inherited from the 
common ancestor. By measuring the exponential decay of auto-correlation with 
genetic distance, we obtain an estimate of the age of the founder event. 
Simulations have shown that this method can accurately estimate the dates of 
recent founder events even in the presence of admixture (Note S4). 

Applying this method to Roma and subtracting the shared Roma and European 
(CEU) autocorrelation, we estimate that a Roma founder event occurred 27 
generations or -800 years ago (assuming one generation = 29 years[18]) (Figure 
4b). This is consistent with reports that the Roma exodus from India occurred 
1,000 years ago[3], and suggests that the migration out of the Indian sub- 
continent may have been associated with a significant founder event in which a 
small number of ancestral individuals gave rise to the present-day Roma 
population. 
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Discussion 



Using genome-wide SNP data from 27 Roma individuals, we have 
provided (1) confirmation of previous mtDNA and Y chromosome results with 
autosomal data, and (2) some new insights that take special advantage of 
autosomal data. 

We have performed formal tests to confirm that Roma are admixed and 
have ancestry from two highly divergent populations: a West Eurasian population 
and a South Asian population. We estimate that the Roma have -80% West 
Eurasian ancestry, reflecting a combined estimate of the AN! ancestry that the 
Roma derive from their South Asian ancestors (pre-exodus) and the European 
ancestry related to the admixture in Europe (post-exodus to Europe). Our 
estimate is broadly consistent with admixture proportions estimated using 
autosomal short tandem repeats (66-100%)[22]. We only estimate a combined 
estimate for the West Eurasian ancestry and so our estimate of 76 ± 4% West 
Eurasian ancestry in the Spanish Roma is not discrepant with the estimates of 
European ancestry (post-exodus only) of 30% based on mtDNA markers and 
37% based on Y chromosome markers reported prevlously[8,23]. 

Our identity-by-descent analysis provides novel insights related to the 
source of the ancestral populations of Roma. We provide evidence for Eastern 
Europe being a major source of European ancestry, and North-west India being a 
major source of the South Asian ancestry in the Roma. Our inferences about the 
geographic origin within South Asia help resolve a long- standing debate related 
to the origin of the Romani people. Our results are consistent with reports from 
linguistics[9] and mtDNA studies[8], which have shown that present day 
Northwest Indian populations (from Kashmir and Punjab), are good candidates 
for being the source of the Indian ancestry in Roma[8,23]. However, we caution 
that IBD based methods require large sample sizes from the source and target 
populations. Hence, a larger sample size will increase the power to detect subtle 
differences between geographic regions. 

A historically informative insight from our analysis is the date of the West 

Eurasian gene flow into Roma. Using a statistic that captures the pattern of 
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admixture related linkage disequilibrium, we estimate that the admixture between 
Roma and West Eurasians occurred 29 ± 2 generations or about 780-900 years 
ago (assuming one generation = 29 years[18]). The earliest records of the arrival 
of Roma in the Balkans date back to the 11*^-12**^ century[3], which is concordant 
with our estimated date of mixture[3]. It is important to note that the Roma have 
ancestry from both AN I and Europeans and thus the estimated date of admixture 
with Europeans (post exodus) is slightly downward biased (older). Simulations 
have shown in the case of two gene flow events, the date of admixture estimated 
by ROLLOFF tends to reflect the date of the recent gene flow event, if the 
interval between the two events is sufficiently large (Table S4, Note S2). 

Disease mutation screening in the Roma has shown that they have an 
increased proportion of private mutations[20]. For example, deletion 1267delG 
that causes a neuromuscular disorder, congenital myasthenia, has a high carrier 
frequency in many Roma groups that reside in different parts of Europe and 
speak different languages. In addition to the Roma groups, this mutation has only 
been observed in South Asian populations before[20,24]. This provides evidence 
that the different Roma groups have a history of a shared founder event. In order 
to obtain temporal information of the founder event that has likely increased the 
frequency of such disease causing mutations, we studied LD based allele 
sharing statistics and estimated that the founder event in Roma occurred about 
27 generations, or 800 years, ago. This agrees with previous reports from Morar 
et al. (2004)[24] who hypothesize that the entire Roma population was founded 
about 32-40 generations ago. 

Our results have confirmed that the Roma have ancestry from South 
Asians and West Eurasian populations, with mixture occurring around 30 
generations ago. An important opportunity for future work is to perform 
homozygosity mapping in Roma that can aid in finding disease-causing 
mutations related to the founder events. In addition, it would be illuminating to 
study the relationship of the Roma with other gypsy populations especially the 
Banjara from India. This may provide new insights into the history of Roma and 
perhaps help to elucidate the historical reasons for their exodus. 
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Materials and Methods 



Datasets: We collected 27 Roma samples belonging to six groups that were 
sampled from four countries in Europe from Hungary (3 linguistically and 
culturally separated sub-groups: 7 samples from Olah (Vlah), 4 samples from 
Beas (Boyash) and 4 samples from Romungro)); 4 samples from Romania, 4 
samples from Spain and 4 samples from Slovakia (Slovakian speaking Roma)). 
All research involving human participants was approved by the Regional Ethics 
Committee Board (REKEB) and the Hungarian National Ethics Committee (ETT 
TUKEB). Each study participant attended a 45-60mins verbal orientation session 
about the study design and goals and then provided written informed consent. All 
the research was conducted according to the principles expressed in the 
Declaration of Helsinki. Roma individuals self-reported as being descendants of 
the same tribe for at least three generations. The samples were genotyped using 
an Affymetrix 1M SNP chip. We required < 5% missing genotype rate per sample 
per SNP to be included in the analysis (27 individuals, 726,404 SNPs passed this 
threshold). These data were merged with data from four other sources, including 
the International Haplotype Map Phase 3 (HapMap3) (n=1,115 samples from 11 
populations genotyped on Affymetrix 1M array)[25], the CEPH-Human Genome 
Diversity Panel (HGDP) (n = 257 individuals from 51 populations genotyped on 
Affymetrix 500K SNP array)[26,27], our previous study of Indian genetic variation 
which we call the "India Project" in this paper (n = 132 individuals from 25 groups 
genotyped on an Affymetrix 1M SNP array)[15] and the Population Reference 
Sample (POPRES) (n = 3,845 individuals from 37 European populations 
genotyped on an Affymetrix 500K SNP array)[28]. 

Population Structure Analysis and Fst calculation: We created a merged 
dataset of Roma and HapMap3 populations {n = 1,142 and 853,727 SNPs). As 
background LD can affect both PCA and ADMIXTURE analysis, we thinned the 
marker set by excluding SNPs in strong LD (pairwise genotypic correlation r^ > 
0.1) in a window of 50 SNPs (sliding the window by 5 SNPs at a time) using 
PLINK v1.07[21]. The thinned dataset contained 61,052 SNPs. We used 
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SMARTPCA[12] to carry out PCA and to compute Fst values. Clustering analysis 
was performed using ADMIXTURE[13]. 

Formal tests of population mixture: To test if Roma have West Eurasian and 
Indian ancestry, we used the unrooted phylogenetic tree ((YRI, CEU), (Onge, 
Roma)) and computed the 4-population test statistic for all three phylogenetic 
trees that can possibly relate these populations. For this analysis, we created a 
merged dataset of Roma, India project and HapMapS populations {n = 1,274 and 
524,053 SNPs). Let YRh, CEUi, OngOi and Romai be the allele frequencies for 
SNP / in the populations YRI, CEU, Onge and Roma respectively. Then we 
compute the p(YRIi-CEUi, OngerRomai) for all SNPs across the genome. In the 
absence of mixture, we would expect this correlation to be almost 0. Standard 
errors were computed using Block Jackknife[29,30] where a block of 5cM was 
dropped in each run. 

Estimating genome-wide ancestry proportion: We estimate the genome-wide 

proportion of ancestry using f4 Ratio Estimation[15] \Nh\ch estimates the excess 
of European ancestry compared to an Onge. We use the phylogenetic tree 
(YRI, (CEU, (Adygei, (Onge, Roma)))) as shown in the Figure S2. It has been 
shown previously that AN! form a clade with CEU and Onge form a clade with 
ASI[15]. YRI and Adygei are used as outgroups in this analysis. Let YR\\, CEUi, 
Adygeii, OngOi and Romai be the allele frequencies for SNP /' in the populations 
YRI, CEU, Adygei, Onge and Roma respectively. We compute ratio of the f^fYRIi, 
Adygeii; Romai-Ongoi)/ f4(YRIi, Adygeii; CEUi-Ongei). This quantity is summed 
over all markers and the standard errors are computed using the Block Jackknife 
(block size of 5cM). To represent all the populations needed for this analysis, we 
created a merged dataset that included data from Roma, India project, HGDP 
and HapMap3 (n = 1,531 and 262,558 SNPs). 

GERMLINE analysis: IBD segments were detected using GERMLINE[19]. For 

this analysis, we phased the data using Beagle[31] and then ran GERMLINE in 
Genotype Extension mode on a combined dataset of Roma, HapMap3, India 
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Project, POPRES and HGDP (n = 5,376 and 205,710 SNPs). We applied the 
following parameters for calculating IBD segments: seed size = 75, minimum IBD 
segments length = 3cM long and the number of heterozygous or homozygous 
errors = 0. The output of GERMLINE was used to compute an average pairwise 
sharing between populations / and J as previously reported in reference [32]. 



where IBD-j= the length of IBD segment shared between individual / and j and 
n, m are the number of individuals in population / and J. 

Estimation of a date of mixture: We applied modified ROLLOFF[M] to 
estimate the date of mixture in a combined dataset containing 1 ,274 individuals 
and 524,053 SNPs. For each pair of SNPs (x,y) separated by a distance d 
Morgans, we compute covariance between (x,y). Specifically, we use the 
following statistic - 



where z(x,j)= covariance between SNPs (x, y) and weight function w(x,)')= 
weight function that can be the allele frequency difference between the ancestral 

populations or the PCA based loadings for SNPs (x, y). We look at the 
relationship of the weighted covariance with genetic distance, and obtain a date 

by fitting an exponential function with an affine term y = Ae'"'' + c , where n is the 
number of generations since admixture and d is the distance in Morgans. 
Standard errors were computed using a Block Jackknife[29,30] where one 
chromosome was dropped in each run. 

Estimating individual autozygosity: We used PLINK v1.07[21] to identify 
autozygous segments in the genome in a combined dataset of 1 ,274 individuals 
and 524,053 SNPs. PLINK uses a sliding window approach to find regions of the 

15 




Average sharing 



nxm 




\x,y\''d 



genome that are at least 1MB in length and contains 100 contiguous 
homozygous SNPs. We allowed one heterozygous and five missing calls per 
segment. Autozygous segments were identified separately for each individual. 
We computed the overall length of autozygous segments for each individual as a 
their measure of genomic autozygosity. We applied this method to compute 
genomic autozygosity for Roma and HapMap individuals (n = 30 from each 
population). 

Estimating a date of founder event: To estimate the date of the founder event, 
we compute the correlation of allele sharing as a measure of LD as described in 
reference [15] using a dataset containing Roma and HapMapS populations [n = 
1,142 and 853,727 SNPs). Specifically, we compute the autocorrelation of allele 
sharing between pairs of individuals of one group, and then subtract the across- 
population autocorrelation to remove the effects of ancestral allele sharing. We 
thus get a measure for the population-specific LD. We plot the auto-correlation 

with genetic distance and by fitting the exponential function y = Ae~^'^ + c where 

D = distance in Morgans and t = time of founder event, we estimate the age of 
the founder event. 
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Figure Legends 



Figure 1. Relationship of Roma with other worldwide populations. We 

applied Principal Component Analysis (PCA) and the clustering algorithm 
ADIVIIXTURE to study the relationship of Roma with the HapMapS and South 
Asian populations. Panel (a) shows the results for PCA of Roma and HaplVlapS 
populations where each point represents an individual and the coloring is based 
on the legend shown on the right. Panel (b) shows the results for ADMIXTURE 
for K=6 for Roma and HapMapS populations. Each vertical line represents an 
individual colored in proportion to their estimated ancestry within each cluster. 
Panel (c) shows the results of PCA of Roma, HapMapS (CEU, CHB) and South 
Asian populations. The populations codes are as follows: Yoruba in Ibadan (YRI), 
Nigeria, Luhya in Webuye, Kenya (LWK), Maasai in Kinyawa, Kenya (MKK), 
Utah residents with Northern and Western European ancestry (CEU), Toscani in 
Italia (TSI), Han Chinese in Beijing, China (CHB), Japanese in Tokyo, Japan 
(JPT), Chinese in Metropolitan Denver, Colorado (CHD), Gujarati Indians in 
Houston, Texas (GIH), African ancestry in Southwest USA (ASW) and Mexican 
ancestry in Los Angeles, California (MEX). 

Figure 2. ROLLOFF Analysis of Roma. We performed ROLLOFF on the Roma 
samples (# samples = 24). We plot the weighted covariance as a function of 

genetic distance, and obtain a date by fitting an exponential function with an 

affine term y = Ae'"'' +c, where d is the genetic distance in Morgans and n is the 
number of generations since mixture. We do not show inter-SNP intervals of 
<0.5cM since we have found that at this distance admixture LD begins to be 
confounded by background LD. 

Figure 3. Evidence for the European and South Asian sources of Roma 

ancestry. We computed a genome-wide average IBD sharing distance between 

Roma and other populations. Panel (a) shows average pairwise IBD sharing 

between Roma and Europeans (non-Roma European individuals from the 

countries in which the Roma were sampled) and panel. All Roma samples were 

combined in one group and (b) shows IBD sharing average pairwise IBD sharing 
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between Roma and individuals from India. Indians were grouped in seven 
regional categories as follows: North includes Tharu, Kharia, Vaish, Srivastava, 
Sahariya, Lodi, Pathan and Sindhi, Northwest includes Kashmiri Pandit and 
Punjabi, Northeast includes Nyasha and Ac Naga, Southwest includes Kurumba 
and Hallaki, Southeast includes Madiga, Mala, Vysya, Chenchu, Naidu, Velama 
and Kamsali, West includes Bhil, Meghawal and Gujarat, East includes Santhal 
and Satnami and Andamanese includes Great Andamanese and Onge. Detailed 
description of these populations can be found in reference [15]. 

Figure 4. Inferring founder events in the Roma. Panel (a) shows estimates of 

genomewide autozygosity in Roma and individuals from HapMap (n = 30 from 
each population). Each point represents an individual colored based on the 
legend shown below. Panel (b) shows the decay of autocorrelation with genetic 

distance. We fitted an exponential function y = Ae~^"^ + c where D = distance in 
Morgans and t = time of founder event to estimate the time of founder event(s) as 
27 generations. 
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NOTE S1. New /70LL0FF statistic 



In this note we consider alternative forms of the ffOLLOFF linkage disequilbrium 
(LD) statistic[1] for dating population admixture events. We show that the origi- 
nal f?OLLOFF statistic is susceptible to downward bias in the event of a recent 
population bottleneck, and we propose a modification of the statistic that is robust 
against such an effect (Table S3). 

The flOLLOFF technique applies two key insights: first, that admixture creates 
LD that decays exponentially as recombination occurs — explicitly, as e~"-'^, where 
n Is the number of generations since admixture and d is the genetic distance 
between SNPs — and second, that the amount of admixture LD between each pair 
of SNPs is proportional to the product of the allele frequency divergences between 
the ancestral populations at those sites. The latter observation allows the e~"-'^ 
admixture LD decay signal to be detected (via a SNP-pair weighting scheme) and 
harnessed to infer the mixture date n. 

The original HOLLOFF statistic captures admixture LD in the form of SNP auto- 
correlation. Defining z{x,y) to be the (Fisher ^-transformed) correlation coefficient 
between SNP calls at sites x and y, ROLLOFF computes the correlation coeffi- 
cient between values of z{x,y) and weights w{x,y) over pairs of SNPs binned by 
genetic distance: 



the idea being that A{d) oc e""''. 

While this setup estimates accurate dates for typical admixture scenarios, it 
turns out to be noticeably biased in the case of a recent bottleneck. However, we 
will show that the following modified statistic does not suffer from the bias: 



(1) 




R{d) : 



^\x-y\^d^i^^yy 



(2) 
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(Note that R{d) amounts to taking the regression coefficient of z{x,y) against the 
weights w{x, y) for SNP pairs within each bin.) 

An additional detail of our flOLLOFF variant is that we modify z{x,y) to mea- 
sure admixture LD as the covariance between SNPs x and y rather than the corre- 
lation (i.e., it equals the classical LD statistic D rather than the correlation r). We 
believe the use of covariance rather than correlation for z{x, y) has little impact on 
the performance and properties of the statistic (as it roughly amounts to multiply- 
ing by a constant factor) but makes the statistic more amenable to mathematical 
analysis. 

Explanation of bias from recent bottlenecks 

The bias in the original formulation of ROLLOFF {\) introduced by a recent bot- 
tleneck can be readily explained at an intuitive level: the problem is that while the 
numerator of the correlation coefficient, Y^\:c-y\^d^i^^y)'^i^^y)^ decays as e"""^ as 
intended, the normalization term 



also exhibits a decay behavior that confounds the e~"'^ signal (Figure S3). The 
reason is that a strong bottleneck introduces a very large amount of LD, effec- 
tively giving z{x,y) a random large magnitude immediately post-bottleneck that 
is independent of the distance between x and y. This LD subsequently decays 
as e""'^ until the magnitude of z{x,y) reaches the level of random sampling noise 
(arising from the finite sample of admixed individuals being used to calculate z). In 
non-bottlenecked cases, the square-norm of z{x,y) is usually dominated by sam- 
pling noise, so the normalization term (3) effectively amounts to a constant, and 
dividing out by it has no effect on the decay rate of A{d). 

The "regression coefficient" version of the f?0/./. OFF statistic (2) does not con- 
tain the normalization term (3) and thus does not incur bias from bottlenecks. 




(3) 
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Precise effect of genetic drift on original and modified ROLLOFF 
statistics 

We now rigorously derive tlie above intuition. We will assume in the following cal- 
culations that the flOLLOFF weights are taken as the product of allele frequency 
divergences 5{x) and 5{y) in the ancestral mixing populations: 

w{x,y) := 5{x)5{y). 

(Our reasoning below applies whether we have the true values of 5{x) and 5{y) or 
compute weights based on related reference populations or PGA loadings, how- 
ever.) We also assume that all SNPs are polymorphic ancestrally — I.e., we Ignore 
mutations that have arisen In the admixed population — and that the SNP ascer- 
tainment Is unbiased with respect to the populations under consideration. 

For a diploid population of size N with chromosomes indexed by i = 1, . . . , 2N, 
\NQ set 

^ 2N 

y) := — ~ ^^)(^^ ~ ^2/) 

i=l 

to be the covariance between binary alleles Xi and Yi at sites x and y, respectively. 
We assume for ease of discussion that the data are phased; for unphased data, 
z{x,y) is essentially a noisier version of the above because of cross terms. 

We are primarily interested In the behavior of z{x,y) from one generation to 
the next. Fix a pair of SNPs x and y at distance d and let zq denote the value of 
z{x,y) at a certain point in time. After one generation, due to finite population size 
and recombination, the covariance becomes[2] 

zi = Zoe-\l - 1/2N) + e, (4) 

where N is the population size, e is the probability of no recombination, (1 - 
1/2N) is a Bessel correction, and e is a noise term with mean and variance on 
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the order of l/N. Iterating this equation over n generations, the final covariance is 



— C "t" tagg, 

where A^e is the effective population size over the interval and Cggg is a sum of n 
partially decayed noise terms. 

Now let time denote the time of admixture between two ancestral populations 
mixing in proportions a and /3 := 1 - a. (The bottleneck may have occurred either 
before or after this point, as long as it does not influence the calculation of the 
weights.) Then a little algebra shows that 

E[z,] = 2af55{x)5{y), 



assuming the mixture is homogeneous and the distance d is large enough that 
background LD can be ignored. (In practice, heterogeneity in the admixed popu- 
lation changes the above form and results in the addition of an affine term to the 
ROLLOFF cuxye which we explicitly fit. We also typically fit only data from SNP 
pairs at distance d > 0.5cM to avoid background LD.) We can now compute the 
modified HO/./.OFF statistic: 



E[R{d)] = E 



2a/3e 



'nd^~n/2Ne 



Importantly, in the last step we use the fact that the combined noise term eagg is 
uncorrelated with 6{x)6{y). Thus, even a strong bottleneck with a low value of 
A^e only scales R{d) by the constant factor e~"/^^% and the e""'' scaling of the 
ROLLOFF cuxye as a function of d is unaffected. 

On the other hand, if we use the original correlation form ([T) of the ROLLOFF 
statistic A{d), then the numerator still has the form of an exponential decay Ae""'^, 
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but now we divide this by the norm ^jY.\^_y\^d^i^^yy- '^^ ^^e case of a strong 
bottleneck, z{x, y) = zoe""'^e""/^^= +eagg can be dominated by the aggregate noise 
term cagg. Indeed, if the bottleneck occurred k generations ago, then the noise 
terms ei from the time of reduced population size will have decayed by e '"^ since 
the bottleneck but can still have large variance if the population size A'^bot was very 
small at the time. In this case, at lower values of d, E[z{x, y)'^] = E[{zoe~"-'^e~''/'^'^^ + 
cagg)^] will be dominated by E[elgg] which will scale approximately as e~'^'"^ / N^ot- 
Hence, the denominator of A{d) will be significantly larger at low d than at high d, 
causing a partial cancellation of the exponential decay of the ROLLOFF curve and 
thus a downward bias in the estimated date of admixture. 
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NOTE S2. Simulations for estimating dates of admixture events 

Simulation 1 : To test the effect of founder events post admixture 

In order to test the effect of founder events post admixture, we performed 
simulations using MaCS[3] coalescent simulator. We simulated data for three 
populations (say, A, B and C). We set the effective population size (A/e) for all 
populations to 12,500 (at all times except during the founder event), mutation and 
recombination rate to 2x10"^ and to 1x10"^ per base pair per generation 
respectively. C can be considered as an admixed population that has 60%/40% 
ancestry from A' and 6' (admixture time {t) was set to 30/ 100 generations before 
present). A' and A diverged 120 generations ago, B' and B diverged 200 
generations ago and A and B diverged 1800 generations ago. At generation x (x 
< t), C undergoes a severe founder event where the effective population size (A/e) 
reduces to 5 individuals for one generation. At generation {x+1), the Ne = 12,500. 
We simulate data for 5 replicates for each parameter. We performed ROLLOFF 
analysis (using the original and modified statistics) with C as the target and A and 
B as the reference populations. When we use the original ROLLOFF statistic, we 
observe that the dates are biased downward in cases of founder events post 
admixture. However, when we use the modified statistics, the bias is removed 
(Table S3). Details of the bias correction are shown in Note S1. Throughout the 
manuscript, we use the modified ROLLOFF statistic {R(d)) unless specified 
otherwise. 

Simulation 2: To test the accuracy of the modified ROLLOFF statistic 

We perform simulations using the same simulation framework as in reference [1] 
to test the accuracy of the estimated dates using the modified ROLLOFF statistic. 
We simulated data for 25 admixed individuals using Europeans (HapMap CEU) 
and HGDP East Asians (Han) as ancestral populations, where mixture occurred 
between 10-300 generations ago and European ancestry proportion was set to 
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20%. These ancestral populations were chosen as Fst(CEU, Han) = 0.09 is 
similar to the Fst between the ancestral populations of the Roma. Figure S4 
shows that we get accurate estimates for the dates of mixture up to 300 
generations. 

Simulation 3: To test the effect of using RCA loadings instead of allele 
frequencies as weights in ROLLOFF 

In the case of Roma admixture, data from unadmixed South Asian populations is 
not available and so it is not possible to compute the allele frequencies of SNPs 

for one ancestral population. However, data from many South Asian populations 
(which are admixed with AN! and ASI ancestry) are available and can be used for 
estimating the PCA-based SNP loadings. We simulations described below that 
mimic this scenario - 

We simulated data for 60 admixed individuals using Europeans (HapMap CEU) 
and HGDP East Asians (Han) as ancestral populations, where mixture occurred 
100 generations ago and European ancestry proportion was set to 30% (group 1 : 
n = 20), 50% (group 2: n = 20) and 70% (group 3: n = 20). These three groups of 
simulated samples can be roughly considered as three South Asian populations. 
We performed PCA analysis with CEU and Groups 1-3 of simulated samples to 
estimate the SNP loadings that can be used in ROLLOFF. 

Next, we simulated data for 54 individuals that can be used as the target in the 
ROLLOFF analysis. These individuals have 80%/20% European and East Asian 
ancestry respectively (similar to Roma) and the date of mixture is set to 30 {n = 
27) and 100 (n = 27) generations before present. We ran modified ROLLOFF 
statistic to estimate the date of mixture in this panel of individuals using the PCA- 
based loadings computed above. We estimated that the dates of mixture were 33 
± 1 and 99 ± 1 generation for mixture that occurred 30 and 100 generations ago 
respectively (Figure S5). This shows that we can effectively estimate the date of 
mixture even in the absence of data from unadmixed ancestral populations, as 
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long as data from other admixed individuals (involving the relevant ancestral 
populations) is available. 

Simulation 4: To test the model of two waves of admixture 

In order to obtain an interpretation of the ROLLOFF estimated date of mixture 
when the model assumption of single wave of mixture is incorrect, we ran 
modified ROLLOFF statistic to infer the date of admixture on data simulated 
under a double admixture scenario. We simulated data using Europeans 
(HapMap CEU) and HGDP East Asians (Han) as the ancestral populations using 
the simulation framework described in reference [1]. We simulated double 
admixture scenarios in which a 50%/50% admixture of CEU and Han occurred at 
Xi (shown in Table S4), followed by a 60%/40% mixture of that admixed 
population and CEU at X2 (shown in Table S4). The mixture proportions were 
chosen so that the final European ancestry proportion is -80% (similar to Roma). 
We ran modified ROLLOFF with a non-overlapping set of Europeans and Han as 
the reference population. Table S4 shows that as the interval (A,2-A,i) between the 
multiple waves of mixture increases, the estimated dates of mixture reflects the 
date of the more recent gene flow event. 
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NOTE S3. Computing corrected IBD sharing distance between 
Roma and Indian groups. 

To find the source of the Indian ancestry in Roma, we inferred the pairwise IBD 
sharing distance between Roma and various Indian groups. We observed that 
the Roma share the highest proportion of IBD sharing with groups from the 
northwest of India (Figure 3b). We were concerned that high IBD sharing could 
be an artifact related to the high proportion of ANI ancestry in the North-western 
Indian groups. Hence, we performed a regression analysis to correct for the 
effect of the ANI ancestry proportion on IBD sharing distance. The model that 
provided the best fit was IBD sharing = 0.3558 + 0.8169*ANI ancestry proportion 
(P-value < 0.05). Each Indian group was considered as a single data point for 
this analysis. Next, we computed an average corrected IBD sharing measure for 
each region by regression out the effect of ANI ancestry and computing an 
average of the residuals for each region in India. Note: For this analysis, we did 
not include the Eastern Indian populations (Nyasha and Ao Naga) and 
Andamanese populations (Onge and Great Andamanese) as these populations 
do not have ANI ancestry. 

In order to control for the effect of the sample size on the IBD computation, we 
performed bootstrap analysis such that for each run, we randomly sampled up to 
30 individuals (some groups had < 30 samples) from each of the 8 Indian groups 
and estimated the IBD sharing statistics between Roma and the Indian groups. 
We performed a total of 100 runs and obtained the mean and standard error of 
the IBD statistic (Figure S7). We observed that Roma still share the highest 
proportion of IBD segments with groups from Northwest of India. 
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NOTE S4. Simulations for estimating date of founder event. 

We used MaCS[3] coalescent simulator to perform simulations to test tine 
robustness of our allele sharing statistic that we use for estimating the dates of 
the founder event. We simulated data for two populations (say, A and S) that 
diverged 1800 generations ago. We set the effective population size for both 
populations as A/g = 12,500, mutation rate = 2x10"^ and recombination rate = 
1x10'^ per base pair per generation respectively. For each simulation, we 
compute the autocorrelation of allele sharing within S, and then subtract the 
across-population autocorrelation between A and B to remove the effects of 
ancestral allele sharing 

Simulation 1: Founder event only 

Pop 6 undergoes a severe founder event x generations ago where the effective 
population size reduces to 5 individuals for one generation. At generation (x+y), 
the population size = A/© again. Table S5 shows that we can accurately estimate 
the date of the founder event using our statistic. 

Simulation 2: Founder event and admixture 

We simulate data for a more complex demography where B is admixed and has 
40% ancestry from A' which is closely related to A. The admixture occurred at 
time f and at time x = 10, 30 or 100 generations, 6 undergoes a severe founder 
event where the effective population size of S reduces to 5 individuals for one 
generation. Table S5 shows that for a recent founder event (10 and 30 
generations ago), we accurately estimate the date of the founder event. 
However, for older founder events (100 generations), we are unable to accurately 
estimate the date of the founder event, if it occurred pre-ad mixture. However, this 
is expected as we are only sampling the admixed population and not the 
ancestral population that underwent the founder event. 
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Simulation 3: No Founder event 

We simulate data for a complex demography where B is admixed and has 40% 
ancestry from A' which is closely related to A. The admixture occurred either 10, 
30, 50 or 70 generations ago. In all cases, we observe that the allele sharing 
statistic is not associated to distance. We test if the model of a straight line 

( J ~ c ) or exponential decay ( y ~ c + Ae''" )^ ^^ere D = genetic distance and t = 
time of founder event) is a better fit to the output. In all four cases, we fail to 
reject the null model ( y ~ c) (P > 0.05). 
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Figure S1. ADMIXTURE Analysis of Roma and HapMap3 populations. 

Results for ADMIXTURE analysis for K=2 to K=7. Each vertical line represents 
an individual colored in proportion to their estimated ancestry within each cluster. 
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Figure S2. Estimating the proportion of Eurasian and South Asian ancestry 
in Roma. In order to estimate the proportion of West Eurasian ancestry in Roma, 
we use the phylogenetic tree shown below. The different colored lines show drift 
that has occurred between the populations connected by the line. The orange 
line shows the drift between YRI and Adygei (a population from the Caucasus) 
and the red and green lines shows the drift separating Roma and Onge. m 
denotes the shared drift between Roma and Onge. See methods for details for 
estimating the West Eurasian ancestry proportion (p) in Roma. This figure is 
adapted from reference [4]. 
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Figure S3. Normalization term from original ROLLOFF correlation 
coefficient formulation. We plot the squared normalization term ^z(x,yf as 



Ia--vI-(/ 



a function of genetic distance d between SNPs for the admixture plus bottleneck 
scenarios described in Table S3, using either the correlation (a) or covariance (b) 
versions of z{x,y). In the case of no bottleneck, the normalization term is 
dominated by finite sampling noise and exhibits no dependence on d. For the 
cases of a strong bottleneck post-admixture, however, '^z(x,yf exhibits an 



exponential decay Ae'^'"' + cwith rate constant approximately equal to twice the 
age of the bottleneck (best-fit k=^5, 25, 46, 65, 83 (a) and /c = 1 2, 20, 41 , 60, 78 
(b) shown as solid lines). 

(a) Using z(x,y) = correlation(x,y) 




(b) Using z(x,y) = covariance(x,y) 
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Figure S4. ROLLOFF Simulation Results: Variable age of mixture. We 

simulated data for 25 admixed individuals with mixed European and East Asian 
ancestry where the proportion of European ancestry was set to 20% and set the 
admixture date between 10-300 generations (as shown below). We ran the 
modified ROLLOFF statistic to estimate the date of mixture using allele 
frequencies in an independent dataset of French and East Asians. Standard 
errors were computed using weighted block jackknife as described in the 
Methods. 
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Figure S5. ROLLOFF Simulation using PCA-loadings. We simulated data for 54 individuals with mixed European and 
East Asian ancestry where the proportion of European ancestry was set to 80% (similar to Roma) and the mixture occurred 
30 generations ago (left panel: n = 27) and 100 generations ago (right panel: n = 27). We ran ROLLOFF to estimate the 
date of mixture in this panel of individuals using the PCA-based loadings computed above. We estimated that the dates of 
mixture were 33 ± 1 and 99 ± 4 generations (the true dates were 30 and 100). 



True Date of admixture = 30 gens True Date of admixture = 100 gens 
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Figure S6. IBD Sharing of Roma with host European populations. We 

computed average pairwise IBD sharing between Roma from each geographical 
region and Europeans from that region (non-Roma European individuals from the 
countries in which the Roma were sampled). 
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Figure S7. Bootstrap analysis to compute error in IBD statistics. We 

performed bootstrap analysis where we randomly sample up to 30 individuals 
from each of the 8 Indian groups and compute the IBD sharing statistics between 
Roma and the Indian groups. We performed a total of 100 runs and obtained the 
mean and standard error of the IBD statistic (vertical bars shown below). For 
Indian groups which had < 30 samples (such as Northeast, Southwest, East and 
Andamanese), all samples were included in each run and so no standard errors 
are shown. 
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Table S1. Average frequency differentiation (Fst) for Roma and HapMap populations 







_YRI 




JPT 


_ASW_ 


CHD_ 




LWK 


MEX 




TSL_ 









0.14 


0.102 


0.104 


0.088 


0.103 


0.033 


0.13 


0.036 


0.093 


0.003 


0.016 


YRI 


0.14 





0.169 


0.17 


0.008 


0.169 


0.129 


0.007 


0.134 


0.025 


0.136 


0.135 


CHB ^1 


0.102 


0.169 





0.007 


0.127 


0.001 


0.071 


0.159 


0.064 


0.131 


0.102 


0.092 


JPT 


0.104 


0.17 


0.007 





0.129 


0.008 


0.072 


0.161 


0.065 


0.133 


0.104 


0.094 


ASW ^1 


0.088 


0.008 


0.127 


0.129 





0.128 


0.083 


0.009 


0.088 


0.013 


0.086 


0.087 


CHD 


0.103 


0.169 


0.001 


0.008 


0.128 





0.071 


0.16 


0.066 


0.132 


0.103 


0.093 


GIH ^^^^ 


0.033 


0.129 


0.071 


0.072 


0.083 


0.071 





0.119 


0.038 


0.086 


0.032 


0.026 




0.13 


0.007 


0.159 


0.161 


0.009 


0.16 


0.119 





0.125 


0.015 


0.126 


0.125 


MEX 


0.036 


0.134 


0.064 


0.065 


0.088 


0.066 


0.038 


0.125 





0.093 


0.037 


0.04 




0.093 


0.025 


0.131 


0.133 


0.013 


0.132 


0.086 


0.015 


0.093 





0.088 


0.089 




0.003 


0.136 


0.102 


0.104 


0.086 


0.103 


0.032 


0.126 


0.037 


0.088 





0.015 




0.016 


0.135 


0.092 


0.094 


0.087 


0.093 


0.026 


0.125 


0.04 


0.089 


0.015 
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Table S2. Formal tests of admixture 



Population 


Sam- 
ples 




Z-scor 

(Pceu'Pyri) 


e for 4 Population test 

(PYRI-Ponae) (Px-PyRi) 


CallllldlcU Weal 

Eurasian 
^ Ancestry % 


Roma 


18 


Hungary 


-33 


4.8 ^ 


-29.3 


78.3 ± 1.9% 


Roma* 


3 


Slovakia 


-26.6 


3.5 


-22.8 


71.5 ±3.1% 


Roma** 


1 


Romania 


-20.2 


0.7 


-19.2 


79.4 ± 4.7% 


Roma 


2 


Spain 


-25.3 


0.9 


-24 


75.6 ± 4.0% 


Roma 


24 


Combined 


-33 


4.8 


-29.5 


77.5 ± 1 .8% 



NOTE: * indicates tliat some samples from tlie group appear to liave recent European gene flow. These 
samples were excluded from the analysis (the number of * indicates the number of samples excluded). 
Ancestry proportions were estimates based on f4 Ratio Estimation using Yoruba, Adygei, Europeans (CEU) 
and Onge as the reference populations. 
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Table S3. Simulations for estimating dates of admixture events: Founder events post admixture model 



True date of 



True date of 
founder 
event (x) 



Date based on original 



Date based on modified 
[LLOFF statistic fb 



Date based on modified 
OFF statistic (c) 



30 


N/A 


31.3 


32.0 


32.1 


30 


5 


24.6 


30.1 


29.0 


30 


10 


27.7 


34.1 


32.3 


30 


20 


23.3 


32.7 


31.0 


30 


25 


23.4 


30.8 


29.5 












100 


N/A 


94.1 


96.8 


97.0 


100 


10 


93.9 


106.1 


102.9 


100 


20 


87.1 


102.7 


97.3 


100 


40 


75.3 


95.6 


92.2 


100 


60 


83.9 


106.3 


102.8 


100 


100 


81.6 


101.1 


99.0 



Note: We simulated data from three populations Pop A (n = 20), Pop B (n = 20) and Pop C (n = 30) using MaCS coalescent simulator. Populations A and B 
diverged 1800 generations ago. The effective population size for all populations was set 12,500 at all times (except during the founder event). The mutation and 
recombination rates were set to 2x1 0'* and 1x10 per base pair per generation. Pop C can be considered as an admixed population that has ancestry 60%/40% 
ancestry from A' and B' (admixture time (t) is set to 30/ 100 generations). Pop A' and A diverged 120 generations and B' and B diverged 200 generations ago. At 
generation x (shown in table above). Pop C undergoes a severe founder event where the effective population size reduces to 5 individuals for one generation. 
When X = N/A, there was no founder event. We performed ROLLOFF (using original and modified statistic) with Pop C as the target and Pop A and B as the 
reference populations. We performed 5 replicates for each parameter and report the average estimated date of mixture. The statistics used were - 



(a) Original ROLLOFF Statistic: A(d)= 



where z(x,/) = correlation between x and y. 



(b) Modified Statistic: R(d)= 



2.- 



(c) Modified Statistic: R(d)= - 



wix,yf • 

-y\-=d 

z(x,y)w(x,y) 



where z(x,/) = con-elation between x and y. 



where z(x,y) = covariance between x and y. 
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Table S4. Simulations for estimating dates of admixture events: Two gene 
flow model 



■^^rr-^ r>^r . Estimated date in 
Date of first wave Date of second wave 

of mixture (Ai) of mixture (Az) generations 

^ ' ^ ' (± standard error) 


120 


20 


36 ±3 


170 


20 


28 ±2 


220 


20 


23 ±2 


270 


20 


24 ±2 


320 


20 


25 ± 1 


370 


20 


25 ± 1 


420 


20 


22 ± 1 








130 


30 


46 ±3 


180 


30 


47 ±3 


230 


30 


41 ±2 


280 


30 


39 ±2 


330 


30 


39 ±3 


380 


30 


35 ±2 


430 


30 


32 ±3 



Note: We simulated 27 individuals using CEU and Han Chinese as tiie ancestral populations 
where we set the overall European ancestry proportion to be 80%. We then performed 
ROLLOFF analysis using the modified statistic with an independent dataset of Europeans 
(HGDP French) and East Asians (HapMap CHB) as reference populations. 
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Table S5. Simulations for estimating dates of founder events 



Simulation 
scenario 




Estimated date of 
founder event 
n generations] 



Founder event only 




10 




11.2 




20 




20.8 




40 




39.3 




60 




52.7 




80 




74.9 




100 




95.7 










Founder event + Admixture 




10 


10 


8.2 




10 


20 


8.4 




10 


40 


8.3 




10 


60 


9.2 




10 


80 


11.8 




10 


100 


9.9 




30 


10 


24.4 




30 


20 


29.9 




30 


30 


30.1 




30 


40 


26.5 




30 


60 


26.2 




30 


80 


27.9 




30 


100 


27.6 




100 


10 


50 




100 


20 


60.9 




100 


40 


67.4 




100 


60 


81.5 




100 


80 


113.3 




100 


100 


92.7 




100 


150 


85.3 



Note: We simulated 20 individuals from Pop A and 25 individuals from Pop B using MaCS coalescent 
simulator. The two populations diverged 1800 generations ago. The effective population size for both 
populations was set 12,500 at all times (except during the founder event). The mutation and recombination 
rates were set to 2x10"® and 1x10"® per base pair per generation. During the founder event, the effective 
population size reduced to 5 individuals for one generation at the date specified in the table above. For each 
simulation we generated data for -450,000 polymorphic sites. SNPs with minor allele frequencies of <1% 
were discarded. 
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